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Abstract — The task of the binary classification problem is to 
determine which of two distributions has generated a length-n 
test sequence. The two distributions are unknown; however two 
training sequences of length N, one from each distribution, are 
observed. The distributions share an alphabet of size rn, which is 
significantly larger than n and N. How does N,n,m affect the 
probability of classification error? We characterize the achievable 
error rate in a high-dimensional setting in which TV, n, m all tend 
to infinity and max{n, N} = o(m). The results are: 

1) There exists an asymptotically consistent classifier if and 
only if m = o(min{iV 2 , Nn}). 

2) The best achievable probability of classification error 
decays as -log(P e ) = Jmin{iV 2 , Nn}(l + o(l))/m with 
J > (shown by achievability and converse results). 

3) A weighted coincidence-based classifier has a non-zero 
generalized error exponent J. 

4) The ^2-norm based classifier has a zero generalized error 
exponent. 

Index Terms — high-dimensional model, large deviations, clas- 
sification, sparse sample, generalized error exponent 

I. INTRODUCTION 

Consider the following binary classification problem: Two 
training sequences X = {Xi, . . . ,X^} and Y = {Yi, . . . , Yjv} 
generated from two different unknown sources are observed. 
The two sources share the same alphabet [m] := {1, . . . , m}. 
Given a test sequence Z = {Zi,...,Z n }, the classifier 
decides whether Z comes from the first source or the second. 

The performance of a classifier is usually assessed by how 
its probability of classification error depends on N, n, m. Since 
the exactly formula for the probability of error is usually 
complicated, asymptotic models and performance criteria are 
used. For example, the classical error exponent criterion 
characterizes the exponential rate at which the probability of 
error decays as n increase to infinity. In addition to assessing 
a particular test's performance, it is desirable to establish 
fundamental limits on the best achievable performance. 

In many applications such as text classification, the number 
of training and test samples observed, N and n, are much 
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smaller than the size of alphabet to. This is the so-called sparse 
sample problem. For example, suppose we want to decide, 
given two articles written by two different others, which author 
writes the third article. The number of words appearing in an 
article is much smaller than the English vocabulary, and the 
histogram of words is a sparse one (T). 

The high-dimensional setting, in which N, n, m all tend to 
infinity and m is much large than N, n, is a useful approach to 
analyze classifiers for the sparse sample problem. A widely- 
used performance criterion is asymptotic consistency: Given 
some dependence of N, n on to, does the probability of error 
decay to zero as m increases to infinity? A fundamental result 
with respect to this criterion was established in [2|: Assuming 
that the values of the probability mass function of all symbols 
in the alphabet are of order 1/m, there exists an asymptotic 
consistent classifier if and only if rn = o(n 2 ). Note that the 
result is established only for the case N = n. 

In most practical scenarios, the number of test samples 
available is smaller than the number of training samples. It 
is thus desirable to understand how N and n affects the 
test performance individually. We thus pose the following 
questions: 

1) How fast do N and n need to increase with m in order 
to have an asymptotic consistent classifier? 
Does the probability of error depend on N and n in the 



2) 
3) 



same way/ 



If the number of training samples is limited, can the test 
performance be improved by having more test samples? 
The goal of this paper is to answer these questions by 
establishing achievability and converse results on best achieble 
probability of classification error. Our tool is the generalized 
error exponent analysis technique from |3]. In this prior work, 
the sparse sample goodness of fit problem is investigated in 
which the number of test samples is much smaller than the 
size of alphabet. The classical error exponent was extended to 
the sparse sample case via a different scaling in large deviation 
analysis. 

In the classification problem, the classsical error exponent 
analysis has been applied to the case of fixed alphabet in 
101 and It was shown that in order for the probability 
of error to decay exponentially fast with respect to n, the 
number of training samples N must grow at least linearly 



with n. However, in the sparse sample case, the classical error 
exponent concept is again not applicable, and thus a different 
scaling is needed. 

We identify the appropriate scaling in this paper, and thereby 
obtain a generalized error exponent to approximate the prob- 
ability of error for large but sparse observations. This analysis 
yields new insights on the best achievable performance: 

1) The numbers of training and test samples N, n have 
different effects on the test performance, made precise 
in Theorem II V. 1 1 and Theorem IIV.2I 

2) The £ 2 - novm based test investigated in |2|, which com- 
pares the l<z distances from the empirical distribution 
of the test sequence to those of the two training se- 
quences, is sub-optimal in that it has a zero generalized 
error exponent, while a weighted coincidence-based test 
proposed in this paper has a non-zero generalized error 
exponent. 

Related work: Two problems that are closely related to the 
sparse sample classification problem are the goodness of fit 
problem and the problem of testing whether two distributions 
are close. For the goodness of fit problem, achievability and 
converse results with respect to different criteria have been 
established in 0, Q, 0, 0, 0. For the problem of testing 
the closeness of two distributions, achievability and converse 
results with respect to asymptotic consistency have been 
established in [10], [11]. Algorithms for testing the closeness 
of two distributions are naturally applicable for classification 
problems. However, it is not clear how they compare to the 
^2-norm based classifier and the weighted coincidence-based 
classifier. 

II. Notation and Model 

Consider the following classification problem: Two training 
sequences X and Y are generated i.i.d. with marginal distribu- 
tions 7r and p, respectively. Each symbol in the sequences takes 
value in [m] : = {1, 2, . . . , to}. A test sequence Z is observed. 
The sequence Z is i.i.d. with marginal distribution ir under the 
null hypothesis HO and with marginal p under the alternative 
hypothesis HI. The three sequences X, Y, Z are independent. 

Denote the set of probability distributions over [to] by 
"P([to]). The pair of unknown distributions (7r,/i) belongs to 
the foUowing set U rn C V([m]) x V([m\), 

n m = { (7T, ^x) : \\p — 7T i > e, max7r ? < — , max/i, < — }, 

i m i m 

where r\ is a large positive constant. The definition of U m 
is essentially the same as the a-large-alphabet source defined 
in 0, except that we allow the number of training and test 
samples to be different. 

The assumption that maxj ttj < ^ , maxj pj < -2- indicates 
that we are interested in how the existence of a large number 
of rare symbols affects the test performance, and is motivated 
by the English vocabulary. Extending the results to the case 
where there are both rare and non-rare symbols is a topic 
currently under investigation. 



In the high-dimensional model, we consider a sequence of 
classification problems as described above, indexed by to. 
Thus V([m]),N, n, ir, p, Ii m all depend on m. Moreover, N, n 
increase to infinity as m increases. 

A classifier <f> — {4> m } m >i is a sequence of binary-valued 
functions with </> m : [m] N x [m] N x [to] 71 — > {0, 1}. It decides 
in favor of HI if <f> m = 1 and HO otherwise. Use the notation 
P(7r, / u,i/)( y l) to denote the probability of the event A when 
X, Y and Z have marginal distributions ir, p, v respectively. 
The performance of a test <fi is evaluated using the worst-case 
average probability of error given by 

Pe{<Pm)= SUp [\P(T,^^){(j) m = l} + 3P( ffAM ){^m = 0}]. 

(7r,^)en m 

A test is said to be asymptotically consistent if 
lim P e (4>m) = 0. 

m— >oo 

In this paper, we are interested in the sparse sample case and 
thus impose the following assumption on the growth rate of 
N and n: 

Assumption 1. N = o(m),n = o(m). 

III. Asymptotic Consistency 

We begin with the asymptotic consistency result. 

Theorem III.l. Suppose Assumption^ holds. There exists an 
asymptotically consistent classifier if and only if 

m = o(min{ N 2 , Nn}). 

The "if direction is a corollary of Theorem |IV. 1 1 and the 
"only if direction is a corollary of Theorem IIV.2I 
We have a few remarks: 

1) For the case N — n, the conclusion of Theorem IIII.ll 
is consistent with the results in J2] Theorem 3 and 4]^] 
Our proof technique is different. 

2) The requirements on N and n for asymptotic consis- 
tency are different: The first requirement m = o{N 2 ) 
needs to be satisfied regardless of how many test samples 
are available. The second requirement is active only 
when n = O(N). Therefore, as long as the number of 
test samples grows linearly with the training samples, 
further increasing the number of test samples will not 
improve the performance in terms of asymptotic consis- 
tency. 

3) On the other hand, increasing the number of training 
samples will always increase the performance. The effect 
of increasing the training samples is different when n = 
O(N) and N = o(n). 

4) The weighted coincidence-based classifier that achieves 
the performance described in Theorem IIII.ll does not 
require the value of m as an input. 

'The achievability result in \2 Theorem 3] does not require Assumption[T] 
We conjecture that Assumption [T] is also unnecessary for the conclusion in 
Theorem lIII.ll under the additional assumption that n — > oo. 



IV. Generalized Error Exponent 

When in is fixed, the following error exponent criterion has 
been used to evaluate a classifier <f>: 

/(<£):= -Kmsupilog(P e (<£)). (1) 

In the sparse sample case where N, n = o(m), the classical 
error exponent criterion given in (HJ is no longer applicable 
since it is zero for all possible classifiers (see Theorem II V. 2b . 
One should consider instead the following generalization, 
defined with respect to the normalization r(N, n, m): 

J{<j>) := - limsup 1 log(P e (<ft m )). (2) 

m— >oo ' l/ V j III') 

The results in Theorem IIV. 1 1 and Theorem IIV.2I imply that the 
appropriate normalization is 

r(N, n,m) = mm{N 2 ,Nn}/m. (3) 

The generalized error exponent J(<j>) could depend on how 
N, n increase with m. Note that to have a consistent classifier, 
the necessary condition in Theorem IIH. 1 1 must be satisfied, as 
summarized in the assumption below: 

Assumption 2. m — o(mm{N 2 , Nn}). 

This is equivalent to lim m _ i . 00 r(N, n, m) = oo. 

The following theorems demonstrate that the definition in 
d2J with the scaling in (0 is meaningful: 

Theorem IV.l (Achievability). Suppose Assumption \T\ and 
Assumption \2\ hold. Then there exists a classifier (j) such that 

J{4>) > 0. 

Theorem IV.2 (Converse). Suppose Assumption Q] holds. 
There exists a constant J such that for any classifier (j), and 
any m, 

-log(P e (0 m )) < r(N,n,m)J. 

These theorems imply that the best achievable probability 
of error decays approximately as P e — exp{—r(N,n,m)J} 
for some J > 0. Note that the probability of error changes 
exponetially with respect to n only when n = O(N). When 
N = o(n), the probability of error is mainly determined by 
the number of training samples. This phenomenon is similar 
to the case with fixed m, for which results in |3J show that 
whether n — O(N) holds determines whether the probability 
of error decreases exponentially in n. 

V. £ 2 "NORM BASED CLASSIFIER HAS A ZERO 
GENERALIZED ERROR EXPONENT 

Let a| be the number of times that jth symbol appears in 
Z. The vectors a x and a v are defined similarly. 

The ^2-norm based classifier has the following test statistic: 

Fn:=||^-ia1|I-|l^-^lI. 
The classifier is given by 

4> F = 1{F„ > 0}. 



This classifier was shown in [2| to be asymptotically consistent 
when N = n and m — o(n 2 ). We now show, however, this 
classifier has zero generalized error exponent: 

Theorem V.l. Suppose N — n and Assumption Q] and As- 
sumption\2\hold. Assume in addition that m = o(n 2 / log(n) 2 ). 
Then 

Je{^) = 0. 

The sub-optimality of cj) F is due to the following reason: For 
any j, a large variation of the value of causes a significant 
change in the value of the statistic F n . Fix some w. Let q = 

. Consider the case where under HO, the distributions of 
X, Y, Z are given by (q, u, q). 

Considering the following event where one symbol appears 
many times: 

C„:=K= L4n/v^J}. (4) 
We claim that this event is likely to cause a false alarm: 

P( q ,u, q) {<p F = l\C n } = 1-0(1). 

On the other hand, the probability of C n decays slowly: 

P(,,u,,) (Cn) = exp{-4(n/V^) log(m)(l + o(l))}. (5) 
Combining these two equality gives the lower-bound 

log(P e (0 F )) > log(iP (9> „, g) (C„)P (9;U , g) {0 F = 1|C„}) 
71 

= -4-=Iog(m)(l+o(l)). 

Thus this error decays at most as nm~i log(m), slower than 
n 2 /m. Consequently, J e (cj) F ) = 0. 

VI. Proof of achievability: weighted 

COINCIDENCE-BASED CLASSIFIER 

A nonzero generalized error exponent is achieved by the 
following weighted coincidence-based classifier, whose con- 
struction is inspired by the weighted coincidence-based test 
proposed in [3]. Define the test statistic T n : 

T « =E [^K - 2 ' «$ = 0} + = 0, a] = 2} 

j 

-^K = ^«l = i}+^iK = i,«l = i} 

- = 0, a| = 2} - ±I{ a y = 2, a* = 0} ] . 

The classifier is given by cf) T — I{T n > 0}. 

Theorem IIV. 1 1 is proved by bounding P e ((f> T ) via Chernoff: 

log(P KA ^){</) T = 1}) < inf A (w) (0). 

9 

log(P ta) {0 T = 0}) < inf A (w) (0). 

u 

where A (?r ^ l/) (6») = logE (7r)/J)I/) [exp(0T„)] is the logarithmic 
moment generating function of T n . The main step is to obtain 
an asymptotic approximation to A^ M „) (#), given in the 
following proposition: 



Proposition VI.l. Let 6 = mm{N 2 , nN}^. For 7 = 0(1), 

\k,ij.,v) (#) 
.min{N 2 ,nN} 



<- 



: (7E(i(^-^) a -i(/*i-^) a )] 

m 

3=1 



0( 



minjiV 2 , nN} max{A^, n} 



) + 0(l). 



Proposition I VI. 1 1 is obtained using the Poisonnization tech- 
nique: The distribution of the vector aj is the same as 
the conditional distribution of a vector of Poisson random 
variables whose expected values are given by Air for some 
constant A > 0, conditioned on the event that the sum of these 
random variables is equal to N. The main steps are similar to 
those used for results in Q. 

Applying Proposition I VI. 1 1 with the Chernoff bound for the 
cases v = it and v = fx, and using Assumption Q] Assump- 
tion |2] the facts 7Tj,/Xj < rj/m and 53j=i(Mj "^j) 2 — £ 2 / m ' 
we obtain 

s 4 min{N 2 ,nN} 



log(P ff , M , ff {^ - 1}) < 
log(P^, M {0 T = 0}) < - 



I6O77 2 



e mm 



m 

T2 



-(1 + 0(1)), 



{N 2 ,nN\, 
1— 1 L (l + o(l)). 



I6O77 2 

Note that the error term o(l) in the approximation is uniform 
over all (it, /i) 6 n m . Therefore, 

J e > 



160?7 2 ' 

VII. Proof of converse 
Step 1: Establish the upper bound, 

-log(P e (0 m )) < JiA^/m. 



(6) 



The main idea of the proof is to consider a event under which 
the observations do not give any information regarding the 
hypotheses, and lower-bound the probability of such a event. 
We now make this precise. Define the event 

A = {No symbol in X appears more than once; 
no symbol in Y appears more than once.} 

Assume without loss of generality that m is even. Let u denote 
the uniform distribution on [m]. Define a collection of bi- 
uniform distributions as follows: Let K m denote the collection 
of all subsets of [m] whose cardinality is m/2. For each set 
w G K m , define the distribution q u as 



q = 



(l + e)/m, jew; 

(1 — s)/m, j € [m] \ to. 



(7) 



Note that \\u - — e, and (u, q u ) e fl m for all ui. 

We will use the short-hand notation {(x,y,z)} = 
{(X, Y, Z) = (x, y, z)} throughout the paper. 

Our choice of the collection of distributions makes sure that 
the following result holds: 



Lemma VII.l. For any sequence (x, y, z) C A, 

h £ no E Pr >, y ,z). 



(8) 



Proof sketch for Lemma WII.1\ For any sequence, let tp i 
denote the number of symbols appearing i times. The vector 
[cpi, (f2, <f3, . ■ ■] is called the profile of the sequence ||T2l . 

Because of the symmetry of the collection of distributions 
{q u , oj G K m }, the symmetry of the uniform distribution 
u, and the independence among X,Y,Z, the value of 
]7?^J Y., J eK m Pr (u,q'-,u){x, y, z) only depends on the profiles 
of x, y, and z. In the event A, the profiles of x and y are 
fixed, which then leads to the claim of the lemma. ■ 

Lemma [Vll. 1 1 implies that for any observation (x,y, z) g 
A, it is impossible to tell whether it is more likely to come 
from the mixture on the left-hand side or the mixture on the 
right-hand side in ([8]). Consequently, 



P, 



>- 



e yrm 

1 



?{v,q» ,u) {4>m = 1} + P (u,q»,q») {4>m = 0}] 



\ " 



[P(q",u, q »){<t>m = l} + P( 



^IfiFlEl Pr J^ = 1 i + , Pr A<t>m=0})] 

11 n - I ' — ' {u,q" ,u) '■<■■ " 



(q w ,u,u) 



> 



■A\K, 

-rr^-y}, Pr S{<f>™ = l}nA)+ Pr ({& n = 0}nA)} 



1 



l T—y}, ({0 m = l}nA) + Pr ({</>,„ = 0}nA) 



{u,q^ ,u) 



Pr (A). 



4\K m \^ (u,q°>,u) 

(9) 

where the first inequality follows from the fact that the 
maximum is no smaller than the average, and the second last 
equality follows from Lemma I VII. 1 1 The probability of the 
event A can be lower-bounded. 

Lemma VII.2. The following approximation holds uniformly 
for any uj: 

N 2 

log( Pr M))=-(l + I £ 2 ) — (l + o(l)) + 0(1). 

Proof sketch: It follows from a combinatorial argument 
that the probability that no symbol appears twice in X when 
X has marginal distribution u is given by 

m(m-l) . . . (m-N + l)(l/m) N = exp{-i (l + o(l))}. 

m 

Estimating the probability that no symbol appears twice in Y 
can be done similarly but is more involved. ■ 

The claim © follows from applying Lemma IVII.2I to (0 
and picking a large enough J. 

Step 2: Establish the second upper-bound 



log(P e (0 m ) < J 2 (Nn + n 2 )/m. 



(10) 



We consider the following event: 

B = {No symbol in Z appears more than once; 

no symbol in Z has appeared in either X or Y}. 

When this event happens, it is impossible (in the worst-case 
setting) to infer which distribution the test sequence is more 
likely to be generated from. This is captured by the following 
lemma: 

Lemma VII.3. Consider any x,y. For any two sequences 
z and z such that (x,y,z) C B and (x,y,z) C B, the 
following holds: 

o E ,.%J x >v> x ) = w-\ E 



Pr (x,y,z) = 



K m \ 1 ' u,q™,u' 

1 



Proof sketch for Lemma \VII.3\ Since no symbol in z has 
appeared in x and y, due to the symmetry of the collection of 
distributions <E K m }, for fixed x and y, the value of 

]7?^E W 6if m Pr («,9",9-)( :E I y> z ) onl y depends on the profile 
of z. It follows from the definition of the event B that the 
profile of z is the same as the profile of z. ■ 
The result of Lemma IVII.3I can interpreted as follows: In 
the event B, observing Z does not gives any information since 
under either hypothesis, each sequence z appears with equal 
probability. 

Consider any x,y. Let D x .y = {z : (x,y,z) e {<p m = 
1} n B } and D x .y = {z : (x,y,z) € {cf> m = 0} n B}. 
Lemma IVH.3I implies that the probability of {X = x,Y = 
V, 4>m = 1} n B only depends on the size of Dx,y, rather 
than what sequences the set Dx,y includes. Consequently, 

T^r-rYX, Pr X{X = x,Y = y,4^ = l}nB) 

+ Pr ({X = x,Y = y,<f> m = 0}nB)] 

(«, q" ,q") 



\D 



x,y\ 



' \K m \ (u,q»,u) L 71 /J Dx,y + D X y 



D 



y^^x,y 



> J_ min{^ Pr = F = »}nB), 



Pr ({X = : r,Y = 2/ }n J B)}, 

(11) 

where the inequality follows from lower-bounding the proba- 
bility of {X = x,Y = y}C\B under (u,q u ,u) and (u,q u ,q u ) 
by the minimum of these two. 

Lemma VII.4. Let J2 = 5. The following bounds hold 
uniformly over all lu,x, y: 

log [ Pr( ^, tt){ x =ai ,y =y } ] (i+od))- 



The proof is similar to that of Lemma IVII.2I 

Note that the average probability of error is equal to the 
summation of the left-hand side of the equality in (fTTT) over 
all possible (a;, y). Applying Lemma [VIL4l to lower-bound the 
right-hand side of the inequality in (fTTT) leads to the claim. 

We now combine (|6]l and ( fTOb . It is straightforward to verify 
that 

mm{N 2 ,Nn + n 2 } < min{iV 2 , 2Nn}. 

We thus obtain the claim of the lemma with J = 
max{ Ji, 2 J2}. 

VIII. Conclusions and Future Work 

We have investigated the binary classification problem with 
sparse samples using the generalized error exponent concept. 
We have established fundamental performance limits, and 
proposed a classifier that performs better than the ^-norm 
based classifier. Future directions include: 

1) Investigate classification algorithms that are applicable 
when there are both rare and frequent symbols. 

2) Investigate theoretical limits and algorithms for the case 
with large number of training samples (m = O(N)) and 
small number of test samples (n — o(m)). 

3) The generalized error exponent criterion could be also 
applicable to the problem of testing closeness of two 
distributions. 
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